A Computational Model of Word Learning from Multimodal Sensory Input

نویسنده

  • Deb Roy
چکیده

How do infants segment continuous streams of speech to discover words of their language? Current theories emphasize the role of acoustic evidence in discovering word boundaries (Cutler 1991; Brent 1999; de Marcken 1996; Friederici & Wessels 1993; see also Bolinger & Gertsman 1957). To test an alternate hypothesis, we recorded natural infant-directed speech from caregivers engaged in play with their pre-linguistic infants centered around common objects. We also recorded the visual context in which the speech occurred by capturing images of these objects. We analyzed the data using two computational models, one of which processed only acoustic recordings, and a second model which integrated acoustic and visual input. The models were implemented using standard speech and vision processing techniques enabling the models to process sensory data. We show that using visual context in conjunction with spoken input dramatically improves learning when compared with using acoustic evidence alone. These results demonstrate the power of inter-modal learning and suggest that infants may use evidence from visual and other non-acoustic context to aid in speech segmentation and spoken word discovery. Introduction Around their first birthday, infants first begin to use word 1 which refer to salient aspects of their environment including objects, actions, and people. They learn these words by attending to the sights, sounds, and other sensations. The acquisition process is complex. Infants must successfully segment spoken input into units which correspond to the words of their language. They must also identify semantic categories which correspond to the meanings of these words. Remarkably, infants are capable of all these processes despite continuous variations of natural phenomena and the noisy input provided by their perceptual systems. This paper presents a computational model of early word learning which addresses three interrelated problems: (1) Segmentation of fluent speech without a lexicon in order to discover spoken words, (2) Categorization of context corresponding the referents of words, and (3) Establishment of correspondence between spoken words and contextual 1 The term “word” is used throughout this paper in accordance with Webster's Dictionary: “A speech sound or combination of sounds having meaning and used as a basic unit of language and human communication.” categories. These three problems are treated as different facets of one underlying problem: to discover structure across spoken and visual inpu . This model has been implemented using standard speech and vision processing techniques. It is able to learn from microphone and camera input (Roy 1999; Roy 2000). We used the model to evaluate the benefit of inter-modal structure for the problem of speech segmentation and word discovery. To gauge the relative usefulness of integrating visual context, we also implemented a uni-modal system which discovered words based on only acoustic analysis (i.e. without access to visual input). Our evaluations demonstrate that dramatic gains in performance are attained when inter-modal information is leveraged. These results suggest that infants would also benefit from attending to multimodal input during even the earliest phases of speech segmentation and spoken word discovery. This work differs from previous computational models of language learning (eg. Gorin 1995; Feldman et. al. 1996; Siskind 1996) in that both linguistic and contextual input are derived from physical sensors rather than relying on human generated symbolic abstractions. CELL: A Model of Learning from Audio-Visual Input We have developed a model of cross-channel early lexical learning (CELL), summarized in Figure 1. This model discovered words by searching for segments of speech which reliably predicted the presence of visually co-occurring shapes. Input consisted of spoken utterances paired with images of objects. This approximated the input that an infant might receive when listening to a caregiver while visually attending to objects in the environment. A speech processor converted spoken utterances into sequences of phoneme probabilities. We built in the ability to categorize speech into phonemic categories since similar abilities have been found in pre-linguistic infants after exposure to their native language (Kuhl et al. 1992; Werker & Tees 1983). At a rate of 100Hz, this processor computed the probability that the past 20 milliseconds of speech belonged to each of 39 English phoneme categories or silence. The phoneme estimation was achieved by training an artificial recurrent neural network similar to (Robinson 1994). The network was trained with a database of phonetically transcribed speech recordings of adult native English speakers (Seneff & Zue 1996). Utterance boundaries were automatically located by detecting stretches of speech separated by silence. A visual processor was developed to extract statistical representations of shapes from images of objects. The visual processor used `second order statistics’ to represent object appearance, as suggested by theories of early visual processing (Julesz 1971). In a first step, edge pixels of the viewed object were located. For each pair of edge points, the normalized distance between points and the relative angle between the two edge points were computed. All distances and angles were accumulated in a two-dimensional histogram representation of the shape (the `second order statistics’). Three-dimensional shapes were represented with a collection of two-dimensional shape histograms, each derived from a particular view of the object. To gather visual data for evaluation experiments, a robotic device was constructed to gather images of objects automatically (Figure 2). The robot took images of stationary objects from various vantage points. Each 2 In this paper we only discuss learning from audio-visual input. The underlying model is able to learn from any combination of input modes, i.e. the model is not dependent on speech or vision. See (Roy 1999) for more details. object was represented by 15 shape histograms derived from images taken from 15 arbitrary poses of the robot. The chi-squared divergence statistic was used to compare shape histograms, a measure that has been shown to work well for object comparison (Schiele & Crowley 1996). Sets of images were compared by summing the chi-square divergences of the four best matches between individual histograms. Figure 1: The CELL model. Camera images of objects are converted to statistical representations of shapes. Spoken utterances captured by a microphone are mapped onto sequences of phoneme probabilities. The short term memory (STM) buffers phonetic representations of recent spoken utterances paired with representations of co-occurring shapes. A short-term recurrence filter searches the STM for repeated sub-sequences of speech which occur in matching visual contexts. The resulting pairs of speech segments and shapes are placed in a long term memory (LTM). A filter based on mutual information searches the LTM for speech-shape pairs which usually occur together, and rarely occur apart within the LTM. These pairings are retained in the LTM, and rejected pairings are periodically discarded by a garbage collection process. Phonemic representations of multi-word utterances and co-occurring visual representations were temporarily stored in a short term memory (STM). The STM had a capacity of five utterances, corresponding to approximately 20 words of infant-directed speech. As input was fed into the model, each new [utterance,shape] entry replaced the oldest entry in the STM. A short-term recurrence filter searched the contents of the STM for recurrent speech segments which occurred in matching visual contexts. The STM focused initial attention to input which occurred closely in time. By limiting analysis to a small window of input, computational resources for search and memory for unanalyzed sensory input are minimized as is required for cognitively plausibility. To determine matches, an acoustic distance metric was developed (Roy 1999) to compare each pair of potential speech segments drawn from the utterances stored in STM. This metric estimated the likelihood that the segment pair in question were variations of similar underlying phoneme sequences and thus represented the same word. The chi-squared divergence metric described earlier was used to compare the visual components associated with each STM utterance. If both the acoustic and visual distance were small, the segment and shape were copied into the LTM. Each entry in the LTM represented a hypothesized prototype of a speech segment and its visual referent. Figure 2: A robot was built to capture images of objects from multiple vantage points. The schematic on the right shows the five degrees of freedom of the imaging system including a turntable for rotating objects. As can be seen from the photograph on the left, the system was designed as a synthetic character to experiment with notions of embodied human-computer interfaces (see Roy, 1999; Roy et al. 1997). Infant-directed speech usually refers to the infant’s immediate context (Snow 1977). When speaking to an infant, caregivers rarely refer to objects or events which are in another location or which happened in the past. Guided by this fact, a long-term mutual information filter assessed the consistency with which speech-shape pairs cooccurred in the LTM. The mutual information (MI) between two random variables measures the amount of uncertainty removed regarding the value of one variable given the value of the other (Cover & Thomas 1991). Mutual information was used to measure the amount of uncertainty removed about the presence of a specific shape in the learner’s visual context given the observation of a specific speech segment. Since MI is a symmetric measure, the converse was also true: it measured the uncertainty removed about the co-occurrence of a particular speech segment given a visual context. Speechshape pairs with high MI were retained, and periodically a garbage collection process removed hypotheses from LTM which did not encode associations with high MI. RECUR: A Model of Learning from Acoustic Input For comparative purposes, we developed a second model, RECUR, which segmented speech using only acoustic information (Figure 3). The acoustic processing in RECUR was identical to that in CELL allowing us to compare them with the same evaluation data. RECUR discovered words by searching for recurrent sequences of speech sounds. The underlying idea of the model, common in current theories of speech segmentation (Brent 1999; de Marcken 1996), is that the learner views language as constructed by an underlying process which concatenates words to generate utterances. By noticing subsequences of speech which often recur, the learner can detect common words and begin to segment fluent speech at word boundaries. Figure 3: The RECUR model. Acoustic waveforms recorded by a microphone are converted to phoneme probabilities. Utterances are buffered by a short term memory (STM) and provide input to a recurrence filter which searches for repeated sequences of speech within the STM. The result is a set of speech segments which are stored in the long term memory (LTM). A second recurrence filter searches for entries in LTM which are repeated often across long spans of time. Such repetitions are evidence that the segment represents a word of the target language and is retained in LTM. A garbage collection process periodically removes segments from LTM which fail to pass through the long-term recurrence filter. Infants are unlikely to search for all possible matches of speech segments across all spoken utterances which they have heard. Such recurrence analysis would require huge amounts of memory for verbatim speech, and would demand impractical computational resources. As suggested by theories of human memory (Miller 1956), our model eases the resource requirements by first searching for recurrent phonemic sequences in a short term window of input. The model performed an exhaustive search for repeated segments in the STM each time a new utterance was added. Recurrent speech sequences were extracted from the STM and copied into LTM. A second recurrence detector compared all LTM segments to one another using the same acoustic distance metric used on the STM. Segments in the LTM which were phonemically similar to many other speech segments in LTM were retained as reliable word candidates. Periodically, unlikely hypotheses which did not match other entries in the LTM were removed by a garbage collection process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning words from sights and sounds: a computational model

This paper presents a model of word acquisition which learns from multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by nding and statistically modeling consistent inter-modal structure. Learning is achieved from multimodal sensor data without any human annotation. An implementation of this model was able to acquire a primitive audio-visual lexicon...

متن کامل

Grounded spoken language acquisition: experiments in word learning

| Language is grounded in sensory-motor experience. Grounding connects concepts to the physical world enabling humans to acquire and use words and sentences in context. Currently most machines which process language are not grounded. Instead, semantic representations are abstract, pre-speci ed, and have meaning only when interpreted by humans. We are interested in developing computational syste...

متن کامل

Pii: S0364-0213(01)00061-1

This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithm...

متن کامل

Word Learning through Sensorimotor Child-Parent Interaction: A Feature Selection Approach

This paper presents a computational model of word learning with the goal to understand the mechanisms through which word learning is grounded in multimodal social interactions between young children and their parents. We designed and implemented a novel multimodal sensing environment consisting of two head-mounted mini cameras that are placed on both the child’s and the parent’s foreheads, moti...

متن کامل

Language Acquisition: The Emergence of Words from Multimodal Input

Young infants learn words by detecting patterns in the speech signal and by associating these patterns to stimuli provided by nonspeech modalities (such as vision). In this paper, we discuss a computational model that is able to detect and build word-like representations on the basis of multimodal input data. Learning of words (and wordlike entities) takes place within a communicative loop betw...

متن کامل

Multimodal Semantic Learning from Child-Directed Input

Children learn the meaning of words by being exposed to perceptually rich situations (linguistic discourse, visual scenes, etc). Current computational learning models typically simulate these rich situations through impoverished symbolic approximations. In this work, we present a distributed word learning model that operates on child-directed speech paired with realistic visual scenes. The mode...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000